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Abstract 

We introduce mixed LICORS, an algorithm for learn- 
ing nonlinear, high-dimensional dynamics from spatio- 
temporal data which can be used for both prediction 
and simulation. Mixed LICORS extends the recent 
LICORS algorithm (Goerg and Shahzi, 2012) from 
hard clustering of predictive distributions to a non- 
parametric, EM-like soft clustering. This retains the 
asymptotic predictive optimality of LICORS, but, as 
we show in simulations, greatly improves out-of-sample 
forecasts with limited data. 

We also implement the proposed methodology in the 
R package LICORS. It is publicly available at the Com- 
prehensive R Archive Network (CRAN). 

1 Introduction 

Recently Goerg and Shalizi (2012) introduced light 
cone reconstruction of states (LICORS), a nonpara- 
metric procedure for recovering predictive states from 
spatio-temporal data. Every spatio-temporal process 
has an associated, latent prediction process, whose 
measure- valued states are the optimal local predictions 
of the future at each point in space and time. LICORS 
consistently estimates this prediction process from a 
single realization of the manifest space-time process, 
through an agglomerative clustering in the space of 
predictive distributions; estimated states are clusters. 
This converges on the minimal set of states capable of 
optimal prediction of the original process. 

Experience with other clustering problems shows 
that soft threshold often predicts much better than 
hard threshold. Famously, while fc-means (Lloyd, 1982) 
is very fast and robust, the expectation maximization 
(EM) algorithm (Dempster, Laird, and Rubin, 1977) 
in Gaussian mixture models gives better clustering re- 
sults. Moreover, the mixture model framework admits 
of a probabilistic interpretation of clustering, and the 
assignment of novel observations to clusters. 

With this inspiration, we introduce mixed LICORS, 



a soft-thresholding version of (hard) LICORS. This em- 
beds the prediction process framework of Shalizi (2003) 
in a mixture-model setting, where the predictive states 
correspond to the optimal mixing weights on extremal 
distributions, which are themselves optimized for fore- 
casting. Our proposed nonparametric EM-like algo- 
rithm then follows naturally. 

After introducing the prediction problem and fixing 
notation, we explain the mixture-model and hidden- 
state-space interpretations of predictive states (§2). 
We then present our nonparametric EM algorithm for 
estimation, with automatic selection of the number of 
predictive states (§3), and the corresponding predic- 
tion algorithm (§4). After demonstrating that mixed 
LICORS predicts better out of sample than hard- 
clustering procedures (§5), we review the proposed 
method and discuss future work (§6). 

2 A Predictive States Model for 
Spatio-temporal Processes 

We fix notation and the general set-up for pre- 
dicting spatio-temporal processes, following Shalizi 
(2003). We observe a random field X{r,t), discrete- 
or continuous-valued, at each point on a regular spa- 
tial lattice (r € S), at each moment of discrete time 
(t e T = 1 : T), for iV = |S X T| observational in all. 
The field (or system) is (c? -I- 1)1? if space S is d dimen- 
sional (plus 1 for time); a video is a (2 -t- 1)D system. 
||r — u|| is a norm (e.g., Euclidean) on the spatial coor- 
dinates r, u G S. Let N — |S| x T be the total number 
of space-time points. 

To optimally predict an unknown (future) X{r, t) 
given (past) observed data V = {X(s, ■u)}ses.neT, it 
is necessary to know P(Ar(r,<) | V). Estimating this 
conditional distribution is quite difficult if we allow 
X(y,t) to depend on all variables in T). Since events 
in a physical system can typically only propagate at 
a finite speed c, X{y, t) can only depend on a subset 
of v. We therefore only have to consider local, spatio- 
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Figure 1: PLC £"(r,t) and FLC £+ir,t) with hp = 3, 
hf = 2,c=l. 



temporal neighborhoods of (r, t) as possibly relevant for 
prediction. 

2.1 Light Cones 

The past light cone (PLC) of (r, t) is all past events 
that could have influenced (r, t) , 

L'{v, t) = {X(u, s) I s < ||r - u|l < c{t - s)} . (1) 

Analogously the future light cone (FLC) L+(r,t) is all 
future events which could be influenced by(r,t). 

In practice, PLCs and FLCs are limited to finite hori- 
zons (Zip, hf)] in combination with c they determine LC 
dimensionality. Figure f shows PLCs of size Zip = 3 and 
FLCs with hf = 2 with speed of propagation c = 1 in 
both (1 + 1)D and (2 + 1)D. 

Thus, for physically-plausible processes, we can re- 
strict ourselves to estimating the conditional distribu- 
tion of X{r,t) given its PLC (^^,(r,i)). Doing this for 
every (r, t) in space-time, however, is still unnecessar- 
ily complex. It is often the case that spatio-temporal 
patterns occur repeatedly, and furthermore different 
histories can also have similar consequences. Thus 
one need not find P{X{r,t) \ (^-,(r,t))) for each(r,t) 
separately, but can first summarize PLCs by predic- 
tively sufficient statistics, e{£~,{r,t)), and then only 
find P(X(r,i) I e{£-,{r,t))) for (usually) K ^ N dif- 
ferent statistics. 

We now construct the minimal sufficient statistic, 
e{£~{r,t)), following Shalizi (2003), to which we refer 
for some mathematical details, and references on pre- 
dictive sufficiency. 

Definition 2.1 (Equivalent configurations). The past 
configurations £'^ at (r, t) and £J at (u, s) are predic- 
tively equivalent, {£~,{r,t)) ^ (£j,(u, s)), if they have 
the same conditional distributions for FLCs, i.e., 

P (i+(r, t) I L-{v, t) = £-) = P (i+(u, s) I L-(u, s) - £ 

Let [(^~,(r,i))] be the equivalence class of (£~,(r, t)), 
the set of all past configurations and coordinates that 



predict the same future as £ does at(r,t). Let 

e(r,(r,i))^[r]. (2) 

be the function mapping each (^~,(r,i)) to its predic- 
tive equivalence class. The values e can take are pre- 
dictive states; they are the minimal statistics which are 
sufficient for predicting from L~ (Shalizi, 2003). 

Assumption 2.2 (Conditional invariance). The pre- 
dictive distribution of a PLC configuration £~ does not 
change over time or space. That is, for all r, t, all u, s, 
and all past light-cone configurations t~ , 



■,(r,t))^(r,(u,s)) 



(3) 



We may thus regard ^ as an equivalence relation 
among PLC configurations, and e as a function over 
£~ alone. 

Assumption 2.2 enables inference from a single real- 
ization of the process (as opposed to many independent 
realizations of the same process). For readability, we 
can also encode the space-time index (r, t) by a single 
index i = 1, . . . , N.-^ 

The future is independent of the past given the pre- 
dictive state (Shalizi, 2003), 



Li 



¥{L+\er,ei£-)): 
and, by construction, 

F{L+\e-)^V{Lt\e{£-)) 



(4) 



(5) 



Thus the prediction problem simplifies from finding 
P (X, I £~) for ah i to only estimating P (AT, | e(£,")) 
for K ^ N states. 

2.2 A Statistical Predictive States 
Model 

The focus on predictive distributions in constructing 
the predictive states does not prevent us from using 
them to model the joint distribution. 

Proposition 2.3. The joint pdf of Xi, . . . , Xfj satis- 
fies 



N 



'(Ai,...,Aj^)=P(M)J]P(A, 



(6) 



where M is the margin of the spatio-temporal process 

M = {A(s, u) I £-{s, u) i {A(r, t) ,(r, i) e S x T}}, 

(7) 

,-that is all points in the field that do not have a com- 
^ pletely observed PLC. 

^ Appendix A in the SM gives an explicit rule to uniquely map 
cach(r,t) to one and only one i; see Eq. (32). 
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Proof. See Supplementary Material. 



□ 



2.4 Log-likelihood of e 



Proposition 2.3 shows that the focus on conditional 
predictive distributions does not restrict the applica- 
bility of the light cone setup to forecasts alone, but 
is in fact a generative representation for any spatio- 
temporal process. 

2.2.1 Minimal Sufficient Statistical Model 

Given e(4") Eq. (6) simplifies to (omitting P (M)) 

JV 

P{X,,...,XN\M,e) = Y[v{X,\i-;e{e-)) 

i=l 

^f[F{X,\e{i-)), (8) 

i=l 

where the second equality follows by (4) . Any particu- 
lar e implicitly specifies the number of predictive states 
K, and all K predictive distributions P (X^ | e{£~)). 
However, in practice only Xi and £~ are observed; the 
mapping e is exactly what we are trying to estimate. 

2.3 Predictive States as Hidden Vari- 
ables 

Since the mapping e and the equivalence classes / pre- 
dictive states e(£~) are in a one-to-one relation, Shalizi 
(2003) and Goerg and Shalizi (2012) do not formally 
distinguish between them. Here, however, it is impor- 
tant to keep distinction the predictive state space S 
and the mapping e : £~ S. Our EM algorithm is 
based on the idea that the predictive state is a hid- 
den variable. Si, taking values in the finite state space 
S = {si, . . . , sk}, and the mixture weights of PLC £^ 
are the soft-threshold version of the mapping e{£~)- 
It is this hidden variable interpretation of predictive 
states that we use to estimate the minimal sufficient 
statistic e, the hidden state space S, and the predic- 
tive distributions P {Xi \ Si = sj). For better readabil- 
ity letP,(.) :=P(. |5, = s,). 

Using the latent variable approach (8) can be equiv- 
alently written as 

N N K 

n p I ^0 = n E 1 = (^^)- (9) 

i—l i—1 j—1 

Eq. (9) is the pdf of a if component mixture model 
with complete data, and 1 {Si — Sj) is a randomized 
version oi e : £~ i-^ Si. 



From (9) the complete data log-likelihood can be ob- 
tained as (we omit the additive constant logP(M)) 

N / K 

£{e; V, S,) = J] log 1 (5, = s,) P (X, | e{£-) = s,) 
i=i \j=i 

N K 
i=l j=l 

(10) 

where the second equality follows since 1 {Si — Sj) ^ 1 
for one and only one j, and otherwise. 

The "parameters" in (10) are e and K; Xi and £'^ 
are observed, and Si is a hidden variable. The optimal 
mapping e : — > 5 is the one that maximizes (10): 

e* = argmax£(e;2?,S'i). (H) 

e 

Without any constraints on ii' or e the maximum is 
obtained for K = N and e{£'^) = "the most faithful 
description of the data is the data".^ As this tells us 
nothing about the underlying dynamics, we must put 
some constraints on K and/or e to get a useful solution. 
For now, assume that if ^ is fixed and we only have 
to estimate e; in Section 3.3, we will give a data-driven 
procedure to choose K . 

2.5 Nonparametric Likelihood Approx- 
imation 

To solve (11) with if ^ A we need to evaluate (10) 
for candidate solutions e. Doing this directly is not 
possible since the 

i) right hand side (RHS) of (10) depends on the un- 
known Si, and 

ii) component distributions P (A^ | e{£~) = Sj) can 
not be evaluated directly unless we use a paramet- 
ric model. Since predictive distributions can have 
arbitrary shapes, we want to use nonparametric 
methods. 

We solve i) by using a nonparametric variant of the 
expectation maximization (EM) algorithm (Dempster 
et al., 1977); for ii) we follow the nonparametric EM lit- 
erature (Benaglia, Chauveau, and Hunter, 2011; Hor- 
des, Chauveau, and Vandekerkhove, 2007; Hall, Nee- 
man, Pakyari, and Elmore, 2005) and approximate 
the P {Xi I e{£~) = Sj) in the log-likelihood with kernel 
density estimators (KDEs) using a previous estimate 

^On the other extreme is a field with only K = 1 predictive 
state, i.e. the iid case. 
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e^"^. That is we approximate (10) with 'd^'^\t]'D,Si): thresholding version of e(^j ), so we can write the ex- 
pected log-hkeHhood in terms of W, 



N K 



" ^ ^' Q(")(W| W("))=^^u;,,.logP(x, I 



»=1 J=l 



where J {X, \ e^"\e~) = sj) is given below in (21). I^^^) 

The current W*^") can be used to update (condi- 

3 EM Algorithm for Predictive ^^^^^^^ probabilities in (i6) by 

State Space Assignment 4^'^ « /(^^ I - ,s,; w(")) (i8) 

Since e maps to 5, the hidden state variable Si and the ^-^(^^i I ^'^j ;W'' ■'^ (19) 

(20) 



X 



"parameter" e play the same role. This in turn results 
in similar E and M steps. Figure 2 gives an overview 

of the proposed algorithm. N 

where i) ivj"-* — J2iLi ^ij^ the effective sample size 

3.1 Expectation Step , ^ ^ _(„) , . , , 

of state Sj, ii) fij and are weighted mean and 

The E-step requires the expected log-likelihood covariance matrix estimators of the PLCs using the 

jth column of W("\ and iii) the PLC distribution is 
g(e I e^")) E5|x,.,(„)^(e; X>, 5,), (13) estimated with a weighted'' KDE (wKDE) 

where expectation is taken with respect to - w(n)A ^ V^w(")jr f\\ in 

P(5, = Sj I X>;e(")), the conditional distribution J[x^ \ = ^j]^ > = ^i^AW^^ ^ ^\\> 

of the hidden variable Si given the data V and the ' 
current estimate e^^\ Using (10) we obtain 
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xlogP(x, |e(")(^-) = s,) 



J21) 

where the weights are again the jth column of W'^"\ 
TV X and Kh-{\xi — a:||) is a kernel function with a state- 

Q(t I e'"^) = ^^^f*^' ~ *J I ^'"H^i^)) dependent bandwidth /ij. Por aU our numerical cal- 

- - culations we use a Gaussian kernel in the i? function 

density (). To obtain a good, cluster-adaptive band- 
width hj we only use those Xi with j = arg max;j. w^fe in 
(14) bw.ndrOO (hard-thresholding of weights; cf. Bcnaglia 
ct al. 2011). After estimation each in (18) must be 

As for lie:!)) we use KDEs and obtain an approxi- ,. , ^(n+i) ffi*"^^' 

,11 1-1 1-1 1 normalized, w,-. 'i ^ ' r^+n ■ 

mate expected log-likehhood T.y=i ' 

Ideally, we would use a non-parametric estimate for 
^ N K l^j^g PLC distribution, e.g., forest density estimators 

Q(")(e I e(")) = jm^P = *J I (Chow and Liu, 1968; Liu, Xu, Gu, Gupta, Laffcrty, 

and Wasscrman, 2011). Currently, however, such esti- 
mators are too slow to handle many iterations at large 
N, so we we model state-conditional PLC distributions 
as multivariate Gaussians. Simulations suggest that 
this is often adequate in practice. 



xlog/(x, |e(")(£r) 



(15) 



The conditional distribution of Si given its PLC and 
PLC, {Xiji^}, comes from Hayes's rule, 

p {s, = s, I x„ er) « p (x„ I s, = sj) p (5, = sj) 

= P,(X,)P,(C)P(5. = s,), 

(16) 



3.2 Approximate Maximization Step 

In a parametric model the M-step would solve 

e("+i) = arg max Q^") (e | e^") ) , (22) 



the second equality following from the conditional in- to improve the estimate. Starting from an initial guess 
dependence of Xi and given the state 5,. the EM algorithm iterates (13) and (22) until con- 

For brevity, let Wij := P {Si = Sj \ X^J~), forming vergence. 

an TV X weight matrix W, whose rows are proba- 3We also tried a hard-threshold estimator, but we found that 

bility distributions over states. This W.^ is the soft- the soft-threshold KDE performed better. 
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0. Initialization: Set n = 0. Split data V — 
{Xi,eY}fLi in Vtrain and Vtest- Initialize states 
randomly from {si, . . . , sk} -> Boolean W^"^ 

1. E-step: Obtain updated W("+i) via (18). 

2. Approximate M-step: Update mixture pdfs 
V{x, I S, = Sj) via (21) with W("+i). 

3. Out-of-sample Prediction: Evaluate out-of- 
saniple MSE for W("+^) by predicting FLCs from 
PLCs in Vtest- Set n = n + 1. 

4. Temporary convergence: Iterate 1-3 until 

||W(") - W("~i)|| < J (23) 

5. Merging: Estimate pairwise distances (or test) 

- dist ,fl"^)yj,k = l,..., K. (24) 

(a) li K > 1: determine (/mi"), fc(mi")) = 
argmin^^^j, djfe and merge these columns 

w(t„,^W^;i„,+W^.„. (25) 

Omit column fc^""'") from W^"l„, , set if = 
if — 1, and re-start iterations at 1. 

(b) If X = 1: return W* with lowest out-of- 
sample MSE. 



In nonparametric problems, finding an e("+i) that 
increases 

Q(")(e I (:(r>.)) is difBcuh, since wKDEs with 
non-zero bandwidth are not maximizing the likelihood; 
they are not even guaranteed to increase it. Optimiz- 
ing (15) by brute force isn't computationally feasible 
either, as it would mean searching state assign- 
ments (cf. Bordes et al. 2007). 

However, in our particular setting the parameter 
space and the expectation of the hidden variable are 
the same, in the sense that w.; is a soft-thresholding 
version of e(^~). Furthermore, none of the estimates 
above requires a deterministic e mapping, but they are 
all weighted MLEs or KDEs. Thus, like Benaglia, 
Chauveau, and Hunter (2009), we take the weights 
from the E-step, W^"+^\ to update each component 
distribution using (21). This in turn can then be 
plugged into (12) to update the likelihood function, 
and in (18) for the next E-step. 

The wKDE update does not solve (22) nor does it 
provably increase the log-likelihood (although simula- 
tions show that it often does so). We thus use cross- 
validation (CV) to select the best W* , and henceforth 
do not rely on an ever increasing log-likelihood as the 
usual stopping rule in EM algorithms (see Section 5 for 
details) . 

3.3 Data-driven Choice of K: Merge 
States To Obtain Minimal Suffi- 
ciency 

The advantage of the mixture model in (37) is that 
predictive states have by definition similar conditional 
distributions. Since conditional densities can be tested 
for equality by a nonparametric two sample test (or 
using a distribution metric), we can merge classes if 
they are close. We thus propose a data-driven auto- 
matic selection of K , which solves this key challenge 
in fitting mixture models: 1) start with a sufficiently 
large number of clusters, K^^^x < N; 2) test for equal- 
ity of distribution each time the EM reaches a (local) 
optimum; 3) merge until K ^ 1 (iid case) - step 5 in 
Fig. 2; 4) choose the best model W* by CV. 

4 Forecasting Given New Data 

The estimate W* can be used to forecast X given a new 
i~ . Integrating out Si yields a mixture distribution 

K 

P I = E IP = I • (^) • (26) 

As Pj (^x'j ^p(^X \ S ^ s^'j is independent of 

£~ we do not have to re-estimate them for each i~ , but 
can use the wKDEs in (21) from the training data. 



Figure 2: Mixed LICORS: nonparametric EM algo- 
rithm for predictive state recovery. 

The mixture weights Wj :~ V (^S ^ Sj \ are in 

general different for each PLC and can again be esti- 
mated using Bayes' rule (with the important difference 
that now we only condition on £^ , not on A): 

Wj(r;W*) cxP(r I S^Sj;W*^ xP(s' = Sj;W*) 

= Af[t;nh),%yM*)^^. (27) 

After re-normalization of w = [Wi, . . . ,Wj^), the pre- 
dictive distribution (26) can be estimated via 

K 

p (a = X I ^J2^r ^{^ = ^\s^ = sj-; w(*)) . 

(28) 

A point forecast can then be obtained by a weighted 
combination of point estimates in each component (e.g. 
weighted mean) , or by the mode of the full distribution. 
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In the simulations we use the weighted average from 
each component as the prediction of X. 

5 Simulations 

To evaluate the predictive ability of mixed LICORS in 
a practical, non-asymptotic context we use the follow- 
ing simulation. The continuous- valued (1 + 1)D field 
X(r, t) has a discrete latent state space (i(r, t). The ob- 
servable field X(r, t) evolves according to a conditional 
Gaussian distribution, 



and initial conditions: X{-, 1) — X{-, 2) 



if |d(r,t)| < 
otherwise, 
(29) 

(30) 



d{r,t) 



EL-2 ^(i- + i mod |S|,f-2) 



EL-i^(i- + imod|S|,t-l) 



(31) 



where [x\ is the closest integer to x. In words, Eq. 
(31) says that the latent state d{r,t) is the rounded 
difference between the sample average of the 5 nearest 
sites at i — 2 and the sample average of the 3 nearest 
sites at t — 1. Thus hp = 2 and c = 1. 

If we include the present in the FLC, (29) gives 
hf — 0, making FLC distributions one-dimensional. 
As d{r, t) is integer- valued and the conditional distri- 
bution becomes Af{0, 1) if \d{r, i) | > 4, the system has 
7 predictive states, {s_3, s_2, ■ • • , S2, S3}, distinguished 
by the conditional mean K{X{r,t) \ Sk) — k. Thus 
X{r,t) I Sfc =AA(fc,l), /e = -3,2,...,2,3. 

Figure 3 shows one realization of (29) - (31) for S — 
{1, . . . , 100} (vertical) and t = 1, . . . , T = 200 (left to 
right), where we discarded the first 100 time steps as 
burn-in to avoid too much dependence on the initial 
conditions (30). While the (usually unobserved) state 
space has distinctive green temporal traces and also 
alternating red and blue patches, the observed field is 
too noisy to clearly see any of these patterns.'* 

Figure 4 summarizes one run of mixed LICORS with 
K — 15 initial states and hp = 2. The first 100 time 
steps were used as training data, and the second half 
as test data. The optimal W* which minimized the 
out-of-sample weighted MSE occurred at iteration 502, 
with K — 9 estimated predictive states. The trace 
plots show large temporary drops in the log-likelihood 

*A11 computation was done in R (R Development Core Team, 
2010). 
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The state space d{Y, t) evolves with the observable field, « 



6 
4 

h2 




tiV' 



50 



100 
Time 



150 



200 



(b) state-space d(r, t) 
Figure 3: Simulation of (29) - (31). 
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Figure 4: Mixed LICORS with hp ^ 2 and Kmax = 15 
starting states, (top-left) nonparametric estimates of 
conditional predictive distributions P {X = x \ S = sj); 
(right) trace plots for log-likelihood (top) and MSE 
(bottom). 



and MSE whenever the EM reaches a local optimum 
and merges two states. After merging, the forecasting 
performance and log-likelihood quickly return to — or 
even surpass — previous optima. 

The predictions from W* in Fig. 5a show that mixed 
LICORS is practically unbiased - compare to the visu- 
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Figure 6: In-sample and out-of-sample forecast MSB 
comparison for mixed versus hard LICORS. 



ally indistinguishable true state space in Fig. 5b. The 
residuals in Fig. 5c show no obvious patterns except 
for a larger variance in the right half (training vs. test 
data) . 



0. Initialize field 1 - t)}%^. Set t = 1. 

1. Fetch all PLCs at time t: Pt 

{^-(r,t)}r6S-" 



2. For each l^{v,t) e Pt- 

(a) draw state Sj [S 

(b) draw X{y, t) 



(x I 5 = Sj) 



3 . If t- "\ ^iiiax ! set t 



t + 1 and go to step f . 



Otherwise return simulation {X{-,ty\^ 



Figure 7: Simulate new realization from spatio- 
temporal process on S"°^ x {!,..., imax} using true 
and estimated dynamics (use (27) in 2a; (21) in 2b). 



and independent realizations are essentially the same, 
which shows that mixed LICORS learns characteristic 
of the system, and is not biased towards random fluc- 
tuations in the training data. As the optimal weights 
often already are 0/1 assignments weighted prediction 
does only slightly better than unique state predictions. 



5.1 Mixed versus Hard LICORS 

Here we show that mixed LICORS indeed outperforms 
hard LICORS over multiple runs. We use 100 inde- 
pendent realizations of (29) ~ (31) and for each one we 
train the model on the first, and test it on the second 
half (future) . The optimal model is the one with lowest 
out-of-sample future MSE. 

In the simulations we use the EM as outlined in Fig. 
2 with i^max = 15 states, 10'^ maximum number of iter- 
ations. We keep the estimate W* with the lowest out- 
of-sample MSE over 10 independent runs. The first run 
is initialized with a K-means-|--|- clustering (Arthur and 
Vassilvitskii, 2007) on the PLC space; state initializa- 
tions in the remaining nine runs are assigned uniformly 
at random from S = {si, . . . , S15}. 

To test whether mixed LICORS accurately estimates 
the mapping e : £^ H> 5, we also predict FLCs of an 
independently generated realization of the same under- 
lying process. If the out-of-sample MSE for the inde- 
pendent field is the same as the out-of-sample MSE for 
the future evolution of the training data, then mixed 
LICORS is not biased towards random fluctuations 
in any given realization, but estimates characteristics 
of the underlying system. For comparison, we use 
weighted as well as unique state forecasts (using the 
argmax rule to determine a unique state). 

Figure 6 summarizes the results. Mixed LICORS 
greatly improves upon the optimal hard LICORS es- 
timates, with up to 66% reduction in out-of-sample 
MSE. Similarly to hard LICORS, the MSE for future 



5.2 Simulating New Realizations of 
The Underlying System 

Recall that all simulations use (30) as initial condi- 
tions. If we want to know the effect of different starting 
conditions, then we can simply use Eqs. (29) & (31) to 
simulate that system, since they fully specify the evolu- 
tion of the stochastic process. In experimental studies, 
however, researchers usually do not have these mech- 
anisms available, but finding them is one of the main 
purposes of the experiment in the first place. 

Since mixed LICORS estimates joint and conditional 
predictive distributions, and not only the conditional 
mean, it is possible to simulate a new realization from 
an estimated model. Figure 7 outlines this simula- 
tion procedure. We will now demonstrate that mixed 
LICORS can be used instead to simulate from different 
initial conditions without knowing Eqs. (29) & (31). 

For example. Fig. 8a shows a simulation using the 
true mechanisms in (29) & (31) with starting condi- 
tions X(-,l) = -1 and X(-,2) = ±3 € RI^I in alter- 
nating patches of ten times 3, ten times —3, ten times 
3, etc. (total of 10 patches since |S| = 100). The first 
couple of columns (on the left) are influenced by differ- 
ent starting conditions, but the initial effect dies out 
soon (since hp — 2) and similar structures (left to right 
traces) as in simulations with (30) emerge (Fig. 3). 

Figure 8b shows simulations solely using the mixed 
LICORS estimates in Fig. 4. While the patterns 
are quantitatively different (due to random sampling). 
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(a) estimated predictive states S{r, t) (b) true predictive states ^(r, t) 

Figure 5: Mixed LICORS model fit and residual check. 



(c) residuals 
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(b) simulations from estimated dynamics 



Figure 8: Simulating another realization of (29) & 
(31) but with different starting conditions. 

the qualitative structures are strikingly similar. Thus 
mixed LICORS can not only accurately estimate 
S{r,t), but also learn the optimal prediction rule (29) 
solely from the observed data X{r,t). 

6 Discussion 

We introduce mixed LICORS, a nonparametric EM- 
like algorithm for estimating the predictive state space 
of a spatio-temporal process. Mixed LICORS is a prob- 
abilistic generalization of hard LICORS and can thus 
be easily adapted to other statistical settings such as 



classification or regression. Simulations show that it 
greatly outperforms its hard-clustering predecessor. 

However, similarly to other state-of-the-art nonpara- 
metric EM algorithms (Bordes et al., 2007; Hall et al., 
2005; Mallapragada, Jin, and Jain, 2010) theoretical 
properties of our EM are not yet well understood. In 
particular, the nonparametric estimation of mixture 
models poses identifiability problems (Bcnaglia et al., 
2009, §2 and references therin). In this work we demon- 
strated that in practice it does not suffer from identi- 
fiability problems, and can outperform (identifiable) 
hard-clustering methods. 

We also demonstrate that mixed LICORS can learn 
spatio-temporal dynamics from data, which can then 
be used for simulating new experiments. Thus mixed 
LICORS can in principle make a lot of expensive, time- 
and labor-intensive experimental studies much more 
manageable and easier to plan. Such an optimal statis- 
tical model can potentially learn spatio-temporal dy- 
namics already after a few experiments. Once learned, 
researchers can use the estimates to simulate their sys- 
tem from different starting conditions within a couple 
of minutes rather than waiting for their empirical ex- 
periments to finish in hours, days, or months. 
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APPENDIX SUPPLEMENTARY 
MATERIAL 

A Proofs 

Here we will show that the joint pdf of the entire field 
conditioned on its margin (Figure 9, gray area) equals 
the product of the predictive conditional distributions. 

Proof of Proposition 2. 3. For simplicity of notation, 
we choose the index set i = 1, . . . ,N in such a rela- 
tion to the spatio-temporal grid(s,t), that the PLC of 
ii cannot contain Xi^ if Z2 > ii- This can be guar- 
anteed by iterating through space-time in increasing 
order over time (for fixed t the order in space does not 
matter). Formally, 

(s,t) ,s e e T ^ (i(t_i).|s|+i, . . . ,i(t-i).|s|-ns|)) 
= it-l)-\S\ + {l,...,\S\). 

(32) 

For consistent notation with the main text, assume 
that we observed the process on an extended grid S x T, 
where S D S and f = {-{hp - 1), . . . , 0} U T. Let the 
extended field have N > N observations, Xi, . . . , Xf^. 
The margin M are all X(s,u), (s,m) G S x T, that 
do not have a fully observed past light cone - see (7) 
for a formal definition. The size of M depends on the 
past horizon hp as well as the speed of propagation c, 
M = M(/ip,c). 

In Figure 9 the extended field Xi, . . . , "lives" on 
the red and gray area, all points with a fully observed 
PLC, Xi,. . . ,Xn, are on the red grid. The PLCs of 
points in the margin (gray) extend into the unobserved 
(blue) area; points in the red area have a PLC that lies 
fully in the red or partially in the gray, but never in the 




Figure 9: Margin of spatio-temporal field in {1 + 1)D. 



blue area. As can be see in Fig. 9 the margin at each 
t is a constant fraction of space, thus overall M grows 
linearly with T; it does not grow with an increasing S, 
but stays constant. 

For simplicity of notation, assume that Xi , . . . , X]\f 
are from the truncated (red) field, such that all their 
PLCs are observed (they may lie in M), and the re- 
maining N — N XjS lie in M (with a PLC that is 
only partially unobserved). Furthermore let Xi := 
{Xi,...,Xk}. Thus 

¥[{x{s,t) |(s,i)eSxf}) =p(xf) 

= P(xf ,M) 

= P (X^ I M) P (M) 

The first term factorizes as 
P (Xf I M) 

= P {Xn I X^-\M) P (Xf -1 I M) 
^P{Xn\£],U {X^-\M} \ {£],}) P (Xf-i I M) 

= ¥{Xn\ ej,) P (xf -1 1 M) 

where the second-to-last equality follows since by (32), 
C {Xk I 1 < fc < TV} U M, and the last equality 
holds since Xi is conditional independent of the rest 
given its own PLC (due to limits in information prop- 
agation over space-time). 
By induction, 

JV-l 

P (Xi, . . . , Xat I M) = n IP {^N-j I iN-j) (33) 

3=0 
N 

= l[p{X,\e-) (34) 

1=1 

This shows that the conditional log-likelihood maxi- 
mization we use for our estimators is equivalent (up to 
a constant) to full joint maximum likelihood estimation 
(MLE). □ 

B Predictive States and Mix- 
ture Models 

Another way to understand predictive states is as the 
extremal distributions of an optimal mixture model 
(Lauritzcn, 1974, 1984). 

To predict any variable L"*", we have to know its 
distribution P(L+). If, as often happens, that distri- 
bution is very complicated, we may try to decompose 
it into a mixture of simpler "base" or "extremal" dis- 
tributions, P(L+ I 0), with mixing weights Tr{0), 

P (L+) = / 7r(6l)P {L+ \e)de . (35) 
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The familiar Gaussian mixture model, for instance, 
makes the extremal distributions to be Gaussians 
(with 6 indexing both expectations and variances) , and 
makes the mixing weights tt{9) a combination of delta 
functions, so that P(_L+) becomes a weighted sum of 
finitely-many Gaussians. 

The conditional predictive distribution of | £^ in 
(35) is a weighted average over the extremal conditional 
distributions P (L+ | 9,e~), 

p (L+ \e-) ^ J Tr{e\r)p {l+ \ e, r) de (36) 

This only makes the forecasting problem harder, unless 

p(L+ I e,t-')Ti{9 1 e.-) p(l+ 1 9{e.-)^ 6{9-9{e-)), 

that is, unless 9{£~) is a predictively sufficient statis- 
tic for L+. The most parsimonious mixture model is 
the one with the minimal sufficient statistic, 9 — e(^~). 
This shows that predictive states are the best "param- 
eters" in (35) for optimal forecasting. Using them, 

K 

P (L+) = ^ P {e{n ^ s,) ¥ {L+ I ein = s,) (37) 

K 

= Y,n,{n-V,{L+) , (38) 

where iTj ) is the probability that the predictive state 
of £- is Sj, and Vj (i+) =V {L+ \ S = sj). Since each 
light cone has a unique predictive state. 



(39) 



I otherwise. 
Thus the predictive distribution given £^ is just 

K 

P {L+ I £-) = ^^ (C) • (^^) - i^^) ■ 

(40) 

Now the forecasting problem simplifies to mapping 
£~ to its predictive state, e{£^) — sj; the appropriate 
distribution- valued forecast is pj(L^), and point 
forecasts are derived from it as needed. 



This mixture-model point of view highlights how 
prediction benefits from grouping points by their pre- 
dictive consequences, rather than by spatial proxim- 
ity (as a Gaussian mixture would do). For us, this 
means clustering past light-cone configurations accord- 
ing to the similarity of their predictive distributions, 
not according to (say) the Euclidean geometry. Mixed 
LICORS thus learns a new geometry for the system, 
which is optimized for forecasting. 
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